Normalize executor and utility query text with $N placeholders#51
Normalize executor and utility query text with $N placeholders#51iskakaushik wants to merge 8 commits intomainfrom
Conversation
Port pg_stat_statements' query normalization to pg_stat_ch so that captured queries never contain raw literal values (passwords, PII, tokens, etc.). Both ClickHouse and OTel exporters now receive normalized text like: SELECT * FROM users WHERE email = $1 AND age > $2 Instead of: SELECT * FROM users WHERE email = 'alice@example.com' AND age > 30 Implementation: - Add post_parse_analyze_hook to capture JumbleState at parse time - Port generate_normalized_query / fill_in_constant_lengths from pg_stat_statements (static functions, not exported by PG) - Store normalized text per-backend in TopMemoryContext, consumed by ExecutorEnd/ProcessUtility/emit_log_hook - Schema names, table names, column names, operators, and SQL structure are preserved; only literal constants are replaced Includes TAP test (027_query_normalization.pl) with 16 subtests covering strings, numbers, negatives, floats, IN lists, LIKE, BETWEEN, subqueries, multi-statement, INSERT/UPDATE/DELETE, and schema-qualified names.
There was a problem hiding this comment.
Pull request overview
This PR introduces literal-constant query normalization (replacing strings/numbers/etc. with $N placeholders) so that exported query text to ClickHouse and OpenTelemetry never contains raw parameter values, leveraging PostgreSQL’s JumbleState from post_parse_analyze_hook.
Changes:
- Add a
post_parse_analyze_hookto generate and stash a normalized query string during parse/analyze. - Port/implement pg_stat_statements-style query normalization utilities (
PschNormalizeQuery) to replace constants with$N. - Add a TAP test suite (
t/027_query_normalization.pl) covering a variety of constant forms and SQL shapes.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| t/027_query_normalization.pl | Adds integration coverage validating that exported query text contains placeholders and not raw literals. |
| src/hooks/query_normalize.h | Declares the normalization API used by hooks to parameterize query text. |
| src/hooks/query_normalize.cc | Implements normalization logic (ported from pg_stat_statements approach) to rewrite constants into $N. |
| src/hooks/hooks.cc | Installs/uses post_parse_analyze_hook and prefers normalized query text when populating events (including error events). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // post_parse_analyze_hook — normalize query text at parse time. | ||
| // The JumbleState (with constant locations) is only available here, so we | ||
| // must generate the normalized text now and stash it for ExecutorEnd. | ||
| static void PschPostParseAnalyze(ParseState* pstate, Query* query, JumbleState* jstate) { | ||
| if (prev_post_parse_analyze != nullptr) { | ||
| prev_post_parse_analyze(pstate, query, jstate); | ||
| } | ||
|
|
||
| // Only normalize if enabled and the query has constants to replace | ||
| if (!psch_enabled || jstate == nullptr || jstate->clocations_count <= 0) { | ||
| return; | ||
| } | ||
|
|
||
| // Free any stale normalized text from a previous parse in this backend | ||
| ClearNormalizedQuery(); | ||
|
|
||
| const char* query_text = pstate->p_sourcetext; | ||
| int query_loc = query->stmt_location; | ||
| int query_len = query->stmt_len; | ||
|
|
||
| // CleanQuerytext extracts the relevant portion for multi-statement strings | ||
| query_text = CleanQuerytext(query_text, &query_loc, &query_len); | ||
|
|
||
| // Allocate in TopMemoryContext so the normalized text survives until | ||
| // ExecutorEnd consumes it (the parse-time context is short-lived). | ||
| MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext); | ||
| normalized_query = PschNormalizeQuery(query_text, query_loc, &query_len, jstate); | ||
| MemoryContextSwitchTo(oldcxt); | ||
| normalized_query_len = query_len; | ||
| } |
There was a problem hiding this comment.
normalized_query is a single per-backend buffer that can become stale or be overwritten before it is consumed. In particular: (1) PschPostParseAnalyze() can run for statements that later hit the early-return path in PschExecutorEnd() (e.g. queryId == 0), leaving normalized_query set; the next captured query will then incorrectly use this stale normalized text. (2) nested statements (SPI / functions) can overwrite normalized_query before the outer statement reaches ExecutorEnd, causing the outer event to lose normalization or pick up the inner query’s text. Consider storing normalized text in a stack keyed by nesting_level, or in a small per-backend map keyed by queryId/statement identity, and ensure the entry is removed on all completion paths (including the PschExecutorEnd() early return and utility paths).
src/hooks/hooks.cc
Outdated
| // Use normalized query if available; fall back to debug_query_string | ||
| if (normalized_query != nullptr) { | ||
| size_t len = Min(static_cast<size_t>(normalized_query_len), sizeof(event.query) - 1); | ||
| memcpy(event.query, normalized_query, len); | ||
| event.query[len] = '\0'; | ||
| event.query_len = static_cast<uint16>(len); | ||
| } else if (debug_query_string != nullptr && debug_query_string[0] != '\0') { | ||
| event.query_len = | ||
| static_cast<uint16>(CopyTrimmed(event.query, PSCH_MAX_QUERY_LEN, debug_query_string)); | ||
| } |
There was a problem hiding this comment.
CaptureLogEvent() copies from normalized_query but never frees/clears it. If an error occurs before the normal query event is built (or if the statement never reaches ExecutorEnd), this leaks TopMemoryContext memory and can cause later events to reuse the wrong normalized text. After copying into event.query, clear the stored normalized buffer (or pop the current entry if switching to a per-statement stack/map).
src/hooks/query_normalize.cc
Outdated
| // A constant is at least 1 byte; $N is at most 11 bytes (INT_MAX). | ||
| norm_query_buflen = query_len + (jstate->clocations_count * 10); | ||
| norm_query = static_cast<char*>(palloc(norm_query_buflen + 1)); | ||
|
|
||
| for (int i = 0; i < jstate->clocations_count; i++) { | ||
| int off; | ||
| int tok_len; | ||
|
|
||
| // If we have an external param at this location but no squashed lists, | ||
| // skip it so the original $N text is preserved. | ||
| if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists) { | ||
| continue; | ||
| } | ||
|
|
||
| off = jstate->clocations[i].location; | ||
| off -= query_loc; | ||
| tok_len = jstate->clocations[i].length; | ||
|
|
||
| if (tok_len < 0) { | ||
| continue; // duplicate, ignore | ||
| } | ||
|
|
||
| // Copy the chunk between last constant and this one | ||
| len_to_wrt = off - last_off - last_tok_len; | ||
| Assert(len_to_wrt >= 0); | ||
| memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt); | ||
| n_quer_loc += len_to_wrt; | ||
|
|
||
| // Insert $N placeholder (and squashed-list comment if applicable) | ||
| n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s", | ||
| num_constants_replaced + 1 + jstate->highest_extern_param_id, | ||
| jstate->clocations[i].squashed ? " /*, ... */" : ""); | ||
| num_constants_replaced++; | ||
|
|
||
| quer_loc = off + tok_len; |
There was a problem hiding this comment.
norm_query_buflen sizing (query_len + clocations_count * 10) is not a guaranteed upper bound once you account for the optional squashed-list suffix (" /*, ... */") and larger $N values. Combined with the use of sprintf(), this can lead to a write past the end of norm_query before the later Assert() triggers. Use a dynamically-growing buffer (e.g., StringInfo) or compute a strict upper bound and write with snprintf() using the remaining capacity, erroring out safely if it would overflow.
t/027_query_normalization.pl
Outdated
| ); | ||
|
|
||
| # Helper: run a query, flush, and return the captured query text from ClickHouse. | ||
| # Uses queryid to identify the specific query. |
There was a problem hiding this comment.
The helper comment says it "Uses queryid to identify the specific query", but get_captured_query() actually selects the most recent non-extension query by timestamp (no queryid filtering). Either update the comment to match the implementation or add a queryid filter (e.g., capture pg_stat_activity.query_id/pg_query_id() equivalent) to make the selection deterministic when multiple statements are exported close together.
| # Uses queryid to identify the specific query. | |
| # Clears prior exported rows and returns the most recent non-extension query by timestamp. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static void FillInConstantLengths(JumbleState* jstate, const char* query, int query_loc) { | ||
| LocationLen* locs; | ||
| core_yyscan_t yyscanner; | ||
| core_yy_extra_type yyextra; |
There was a problem hiding this comment.
core_yy_extra_type yyextra is uninitialized when passed to scanner_init(). If scanner_init() doesn’t fully overwrite all fields, this risks undefined behavior. Consider zero-initializing yyextra before calling scanner_init().
| core_yy_extra_type yyextra; | |
| core_yy_extra_type yyextra = {}; |
src/hooks/hooks.cc
Outdated
| // The normalized registry is keyed by statement identity and reused across | ||
| // repeated executions of cached plans, so this helper first looks up the | ||
| // normalized entry stashed at parse time. If no match exists, it falls back to | ||
| // CopyRawStatementText, which preserves the literal SQL text for the current | ||
| // statement only. | ||
| static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location, | ||
| int stmt_len) { | ||
| if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query, | ||
| sizeof(event->query), &event->query_len, query_text, | ||
| stmt_location, stmt_len, false)) { |
There was a problem hiding this comment.
CopyQueryText() uses consume=false when calling PschCopyNormalizedQueryForStatement(), so normalized entries remembered at parse time are never removed on the successful path. Since those entries live in TopMemoryContext and are keyed by raw source_text, sessions that execute many literal-variant statements can accumulate entries and grow backend memory without bound. Consider consuming/removing entries after enqueue (when safe) and/or adding an eviction/size cap (or keying by queryId) to prevent unbounded growth.
| // The normalized registry is keyed by statement identity and reused across | |
| // repeated executions of cached plans, so this helper first looks up the | |
| // normalized entry stashed at parse time. If no match exists, it falls back to | |
| // CopyRawStatementText, which preserves the literal SQL text for the current | |
| // statement only. | |
| static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location, | |
| int stmt_len) { | |
| if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query, | |
| sizeof(event->query), &event->query_len, query_text, | |
| stmt_location, stmt_len, false)) { | |
| // The normalized entry is only needed until it has been copied into the event | |
| // buffer, so consume it on a successful lookup to avoid unbounded retention in | |
| // the per-backend registry. If no match exists, fall back to the raw statement | |
| // text for the current statement only. | |
| static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location, | |
| int stmt_len) { | |
| if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query, | |
| sizeof(event->query), &event->query_len, query_text, | |
| stmt_location, stmt_len, true)) { |
src/hooks/hooks.cc
Outdated
| if (PschCopyNormalizedQueryForLog(&backend_state.normalized_queries, event.query, | ||
| sizeof(event.query), &event.query_len, debug_query_string, | ||
| pgstat_get_my_query_id(), edata->cursorpos)) { | ||
| } else if (debug_query_string != nullptr && debug_query_string[0] != '\0') { | ||
| event.query_len = |
There was a problem hiding this comment.
If PschCopyNormalizedQueryForLog() can’t find a match, the code falls back to copying debug_query_string verbatim into the event, which can still include raw literals (passwords/PII). This contradicts the PR description’s claim that “no raw parameter values are ever captured”. If the intent is a hard guarantee, consider emitting an empty/"" query (or gating raw capture behind a GUC) when normalization lookup fails.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PschNormalizedQueryEntry* next; | ||
| }; | ||
|
|
||
| PschStatementKey PschMakeStatementKey(const char* source_text, int stmt_location, int stmt_len) { |
There was a problem hiding this comment.
SourceTextMatch() treats pointer equality (entry->source_text == source_text) as an unconditional match. Postgres can reuse memory addresses for new query strings across statements (e.g., after memory context resets), so pointer equality alone can incorrectly match a different statement and return the wrong normalized query. Consider removing the pointer fast-path, or at least validating length/content even when pointers match (e.g., compare lengths and memcmp against source_text_copy).
| PschNormalizedQueryEntry* next; | |
| }; | |
| PschStatementKey PschMakeStatementKey(const char* source_text, int stmt_location, int stmt_len) { |
src/hooks/hooks.cc
Outdated
| static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location, | ||
| int stmt_len) { | ||
| const PschStatementKey statement_key = PschMakeStatementKey(query_text, stmt_location, stmt_len); | ||
| if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query, | ||
| sizeof(event->query), &event->query_len, statement_key, | ||
| false)) { | ||
| return; | ||
| } | ||
|
|
||
| CopyRawStatementText(event, query_text, stmt_location, stmt_len); |
There was a problem hiding this comment.
Normalized query entries are always copied with consume=false, and successful executor/utility paths never call ForgetNormalizedStatement(). This makes the per-backend normalized query registry grow without bound over long-lived sessions that execute many distinct statements, which can lead to backend memory bloat. Consider consuming entries by default after they’re used (and only retaining them for known-reuse cases like cached plans/SPI), or enforcing an upper bound/eviction strategy for the registry.
Summary
post_parse_analyze_hookquery normalization based on PostgreSQL'sJumbleState, portinggenerate_normalized_query/fill_in_constant_lengthsfrompg_stat_statementsExecutorEnd/ProcessUtilityusing exact statement-scoped state, including nested SPI reuse and multi-statement handlingemit_log_hook; error events are still captured, butevent.queryis left empty instead of doing fuzzy reconstructionTest plan
./scripts/run-tests.sh ../postgres/install_tap tap 008./scripts/run-tests.sh ../postgres/install_tap tap 010./scripts/run-tests.sh ../postgres/install_tap tap 022./scripts/run-tests.sh ../postgres/install_tap tap 027./scripts/run-tests.sh ../postgres/install_tap tap 028./scripts/run-tests.sh ../postgres/install_tap tap 029./scripts/run-tests.sh ../postgres/install_tap tap 030